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Abstract 

Background: In the past decade, transcriptome data have become an important component of many phylogenetic 
studies. They are a cost-effective source of protein-coding gene sequences, and have helped projects grow from a 
few genes to hundreds or thousands of genes. Phylogenetic studies now regularly include genes from newly 
sequenced transcriptomes, as well as publicly available transcriptomes and genomes. Implementing such a 
phylogenomic study, however, is computationally intensive, requires the coordinated use of many complex software 
tools, and includes multiple steps for which no published tools exist. Phylogenomic studies have therefore been 
manual or semiautomated. In addition to taking considerable user time, this makes phylogenomic analyses difficult to 
reproduce, compare, and extend. In addition, methodological improvements made in the context of one study often 
cannot be easily applied and evaluated in the context of other studies. 

Results: We present Agalma, an automated tool that constructs matrices for phylogenomic analyses. The user 
provides raw lllumina transcriptome data, and Agalma produces annotated assemblies, aligned gene sequence 
matrices, a preliminary phylogeny, and detailed diagnostics that allow the investigator to make extensive assessments 
of intermediate analysis steps and the final results. Sequences from other sources, such as externally assembled 
genomes and transcriptomes, can also be incorporated in the analyses. Agalma is built on the BioLite bioinformatics 
framework, which tracks provenance, profiles processor and memory use, records diagnostics, manages metadata, 
installs dependencies, logs version numbers and calls to external programs, and enables rich HTML reports for all 
stages of the analysis. Agalma includes a small test data set and a built-in test analysis of these data. In addition to 
describing Agalma, we here present a sample analysis of a larger seven-taxon data set. Agalma is available for 
download at https://bitbucket.org/caseywdunn/agalma. 

Conclusions: Agalma allows complex phylogenomic analyses to be implemented and described unambiguously as 
a series of high-level commands. This will enable phylogenomic studies to be readily reproduced, modified, and 
extended. Agalma also facilitates methods development by providing a complete modular workflow, bundled with 
test data, that will allow further optimization of each step in the context of a full phylogenomic analysis. 
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Background 

Transcriptome data are fast becoming an important and 
cost effective component of phylogenetic studies [1-5]. 
The rapid fall in sequencing prices has contributed to 
the growing number of phylogenetic studies that integrate 
data from genomes and transcriptomes, often referred 
to as "phylogenomic" analyses. There is wide recognition 
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that adding data from a larger number of genes is neces- 
sary to address many open phylogenetic questions, though 
of course additional gene sequences alone will not be 
sufficient to resolve them all [6-8]. 

The primary impediments to wider adoption and fur- 
ther improvement of phylogenomic methods are the com- 
plexity of the analyses and the lack of integrated tools to 
conduct them. Each phylogenomic study requires many 
steps, the vast majority of which concern matrix con- 
struction rather than phylogenetic analysis itself. These 
steps include raw data filtering, assembly, identification 
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of ribosomal RNA, selection of transcript splice vari- 
ants, translation, identification of homologous sequences, 
identification of orthologous sequences, sequence align- 
ment, phylogenetic analysis, and summary of results. 
Implementing a phylogenomic analysis is not just a mat- 
ter of executing available tools for each of these steps. 
Among other challenges, results must be summarized 
across multiple steps, detailed records must be kept of 
all analysis steps, data files often need to be reformat- 
ted between analyses, and computational load must be 
balanced according to the available resources. 

Because phylogenomic studies are complex and have 
been manual or semi-automated, they are difficult to 
implement and explicitly describe, and require extensive 
technical effort to reproduce. These problems can make 
it difficult to evaluate results, integrate data across stud- 
ies, expand analyses, or test the impact of alternative 
analysis approaches. In addition, manual analyses often 
include many subjective decisions that may impact the 
final results. 

Some higher-level pipelines have addressed subsets of 
phylogenomic analyses. These tools include PartiGene 
[9], a pipeline to aid in the assembly and annotation of 
Sanger transcriptome data collected across a diversity of 
species, and SCaFoS [10], a semi-automated tool for the 
curation of super matrices from previously assembled 
transcriptomes. No existing tool, however, can execute a 
full phylogenomic analysis of modern sequence data. 

We addressed these needs by developing Agalma, an 
automated phylogenomics workflow. Using Agalma, an 
investigator can conduct complete phylogenomic analy- 
ses, from raw sequence reads to preliminary phylogenetic 
trees, with a small number of high-level commands. The 
results are accompanied by detailed reports that integrate 
diagnostic information across data sets and analysis steps. 
In a first pass with Agalma, the investigator conducts the 
analysis under default settings. The investigator then con- 
sults the reports to consider how best to optimize the 
analyses, and easily re-executes them with updated set- 
tings to produce new matrices and preliminary trees. The 
investigator can then analyze the optimized matrices with 
external phylogenetic inference tools not already included 
within Agalma to explore other factors, such as model 
sensitivity. 

Implementation 

We built Agalma with BioLite [11], a generic bioinfor- 
matics pipeline framework. BioLite profiles memory and 
CPU use, tracks provenance of all data and analyses, 
and centralizes diagnostic reporting. Agalma is a modu- 
lar workflow comprised of helper scripts and a series of 
pipelines. Each pipeline is made up of stages that call inter- 
nal analysis functions (many implemented with the help 
of Biopython [12]) and wrappers from the BioLite Python 



module. The wrappers invoke command-line tools, which 
include external bioinformatics tools such as the Bowtie2 
aligner [13] and Trinity assembler [14], as well as sev- 
eral C++ tools from BioLite. All calls to these external 
programs are logged, as are the version numbers of the 
programs and a hash of the binary so that it is possible 
to detect differences in compiler settings. This informa- 
tion about calls to external programs is available to the 
user via the agalma diagnostics command. We pro- 
vide automated installation tools for Agalma, BioLite, and 
all required third party software. As a result, complete 
installation of Agalma and all dependencies requires only 
a handful of commands on OS X and Ubuntu, and the 
scripts facilitate installation on many other UNIX com- 
pliant systems as well. Installation details are provided in 
the README and INSTALL files included with BioLite and 
Agalma. 

All interactions with Agalma are with the command 
agalma. The command agalma assemble, for exam- 
ple, provides access to the assemble pipeline. These 
commands include built-in help via the -h option, 
which provides details on all arguments and the param- 
eters for external packages that are exposed to the user 
through Agalma. The command agalma -h provides 
help for the top-level wrapper. The command agalma 
ass emb 1 e - h, for example, provides help for the assem- 
ble command. Additional details regarding the interface 
can be found in the README file. 

The first step for analyzing each data set, whether it 
consists of raw reads to be assembled or of previously 
assembled gene predictions, is to catalog the data. This 
creates a database entry that includes sample metadata 
and the paths to the data files. Agalma has built-in support 
for transcriptome assembly of pair-end Illumina data only 
in FASTQ format. When analyzing public data, raw reads 
and associated metadata can be imported directly from 
the NCBI Sequence Read Archives (SRA) using the com- 
mand sra import. This command downloads the reads 
for a given SRA accession number (experiment, study, 
sample, or run), converts them into FASTQ format, and 
populates the catalog with the corresponding data paths 
and metadata. When previously assembled gene predic- 
tions are used, the assembly files in FASTA format can be 
directly catalogued. 

There are several distinct tasks subsequent to cata- 
loging the data-sequence assembly, loading the genes into 
the database, and phylogenetic analysis. These tasks are 
described in detail in the README and TUTORIAL files 
provided with Agalma, and are briefly summarized below. 

Assembly 

The pipeline transcriptome runs an assembly from 
read filtering through assembly and preliminary annota- 
tion. In a typical analysis, transcriptome would be run 
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once for each species for which raw Illumina transcrip- 
tome data are available. The transcriptome pipeline 
executes the following sub-pipelines, which can also be 
run individually: 

• sanitize filters raw paired-end Illumina data and 
randomizes the order of reads (maintaining read 
order between paired files) to facilitate subsequent 
subsetting. Reads are discarded if they do not meet a 
specified mean quality threshold, if they contain 
Illumina adapter sequences, or if the base 
composition is highly skewed (if any base represents 
either < 5% or > 60% of the sequence). This pipeline 
also generates FastQC [15] summaries of read quality. 

• insert_size uses subassemblies and Bowtie2 
mapping to estimate the mean and variance of the 
insert size (i.e., the length of the fragment between 
the sequencing adapters). This information provides 
important feedback on the success of sample 
preparation, and is also used in some downstream 
analysis steps. 

• remove_rrna removes most reads that are derived 
from ribosomal RNA (rRNA). Removing rRNA in 
advance of assembly can reduce both the number of 
chimeric transcripts and the time required for 
assembly. This pipeline first assembles multiple 
subsets of reads. A range of subset sizes is used since 
the optimal number of reads for assembling a 
particular rRNA transcript depends upon multiple 
factors, including the fraction of reads that originate 
from rRNA and the uniformity of coverage across the 
rRNA transcripts (which can vary greatly, depending 
on how denatured the samples were prior to 
fragmentation). rRNA transcripts are then identified 
by blast comparison of these subassemblies to an 
included dataset of known rRNA sequences. The 
entire set of reads is then compared to the rRNA 
transcripts that are identified from the subassemblies, 
and any reads that Bowtie2 maps to them are 
removed. A plot of the distribution of reads across 
exemplar rRNA transcripts is shown to help evaluate 
rRNA assembly success. The top hit in the NCBI nt 
database is also provided as an independent check on 
sample identity and to help spot potential 
contaminants. The fraction of reads that derive from 
rRNA is also reported to aid in improving library 
preparations. 

• ass emb 1 e filters the reads at a higher quality 
threshold and assembles them. Assemblies can be 
conducted under multiple protocols (such as multiple 
assemblers, or the same assembler under different 
settings). This pipeline can also assemble multiple 
subsets of different numbers of reads, which provides 
perspective on how sequencing effort impacts 



assembly results. The default assembler is Trinity 
[16]. The wrapper we have included in Agalma for 
running Trinity makes two improvements over the 
wrapper script that comes with Trinity. First, we have 
added a filter in between the Chrysalis and Butterfly 
stages to remove components that are smaller than 
the minimum transcript length parameter passed to 
Butterfly, since running Butterfly on these 
components will not yield a transcript. For the five 
assemblies in our test data set, this reduces the 
number of Butterfly commands from roughly 100,000 
to 60,000. Second, we have replaced the ParaFly 
utility that is used for concurrent execution of the 
Butterfly commands with the GNU parallel tool 
[17] because it has better parallel efficiency. ParaFly 
executes the commands concurrently, but in blocks, 
so that the time to execute a block is the runtime of 
the slowest individual command. The runtimes can 
vary greatly because of variance in transcript length 
and complexity. In contrast, parallel load 
balances the commands across the available 
processors. 

• post assemble uses blastn against the rRNA 
reference sequences to identify rRNA transcripts 
(these could include low abundance transcripts, such 
as parasite contaminants, that were not removed as 
reads by remove_rrna) and screens against the 
NCBI UniVec database to identify vector 
contaminants (such as protein expression vector 
contaminants in the sample preparation enzymes, 
which we have encountered in multiple samples). 
The longest likely coding region per transcripts is 
identified and annotated using blastp against the 
NCBI SwissProt database. Reads are mapped to the 
assembly and all transcripts quantified using RSEM 
[18], and the splice variant with the highest 
expression is selected as the exemplar transcript for 
each gene for downstream analyses. It also produces a 
coverage map that helps assess the distribution of 
sequencing effort across genes. Finally, it uses 
blastx to compare all the transcripts against the 
NCBI SwissProt database to establish which are 
similar to previously known proteins. 

Following these steps, the investigator can inspect the 
assembled data directly or load them into the Agalma 
database to prepare them for phylogenetic analysis. 

Load genes into the local Agalma database 

Subsequent phylogenetic analyses require that all gene 
sequences to be considered are loaded into the local 
Agalma database. The load command takes care of 
this process. In a typical analysis, load is executed 
once for each dataset that has been assembled by 
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the transcriptome pipeline described above, and 
once for each set of gene predictions from external 
sources (e.g., externally assembled 454 transcriptome 
data or gene predictions from genome sequencing 
projects). 

Phylogenetic analysis 

Once assemblies for multiple species are loaded into the 
local Agalma database, the user carries out a phyloge- 
nomic analysis by consecutively executing the following 
pipelines: 

• homo 1 og i z e allows the user to specify which 
datasets to include in a particular phylogenetic 
analysis. It then uses an all-by-all tblastx search to 
build a graph with edges representing hits above a 
stringent threshold, and breaks the graph into 
connected components corresponding to clusters of 
homologous sequences with the Markov Clustering 
Algorithm (MCL) tool [19]. 

• mu 1 1 a 1 i gn applies sampling and length filters to 
each cluster of homologous sequences, and uses 
another all-by-all tblastx within each cluster to 
trim sequence ends that have no similarity to other 
sequences in the cluster (these could include, for 
example, chimeric regions). The sequences of each 
cluster are aligned using MACSE [20], a 
translation-aware multiple-sequence aligner that 
accounts for frameshifts and stop codons. Multiple 
sequence alignments are then cleaned with 

G BLOCKS [21]. Optionally, the alignments can be 
concatenated together to form a supermatrix. 

• gene tree uses RAxML [22] to build a maximum 
likelihood phylogenetic tree for each cluster of 
homologous sequences. Gene trees can be filtered 
according to mean bootstrap support, which 
eliminates genes that have little phylogenetic signal 
[23] and reduces overall computational burden. This 
filter can be applied prior to running treeprune 
(described below), which has the added advantage of 
restricting ortholog selection to only well-supported 
gene trees. All options available in RAxML can be 
passed as optional arguments. If the input is a 
supermatrix consisting of concatenated orthologs, it 
builds a species tree. 

• treeprune identifies orthologs according to the 
topology of gene phylogenies, using a new 
implementation of the method introduced in a 
previous phylogenomic study [24]. It uses DendroPy 
[25] to prune each gene tree into maximally inclusive 
subtrees with no more than one sequence per taxon. 
Each of these subtrees is considered as a set of 
putative orthologs. The pruned subtrees are 
re-entered as clusters into Agalmas database. 



After treeprune, the user can build a superma- 
trix and a preliminary maximum likelihood species 
tree with RAxML. These steps, which include rerun- 
ning multalign and genet ree on the orthologs, are 
detailed in the Agalma TUTORIAL file. Agalma pro- 
duces a partitions file that describes which regions of the 
supermatrix correspond to which genes. The user can 
then proceed with more extensive phylogenetic analyses 
of the supermatrix using external phylogenetic inference 
software of their choice (only RAxML is included with 
Agalma at this time). As the alignments for each gene are 
also provided, the investigator can also apply promising 
new approaches that simultaneously infer gene trees and 
species trees [26]. 

Test data and analyses 

A small set of test data is provided with Agalma. It consists 
of 25,000 lOObp Illumina read pairs for the siphonophore 
Hippopodius hippopus, a subset of 72-74 gene sequences 
assembled for each of five siphonophores, and a subset 
of 40 gene predictions from the Nemato stella vectensis 
genome assembly. These data were chosen because they 
run relatively quickly and enable testing of most com- 
monly used features. 

These test data serve several purposes. They allow a user 
to validate that Agalma is working correctly, and users are 
strongly encouraged to run this test with the command 
agalma test right after installation. This test analysis 
takes about 20 minutes on a 2-core 2 GHz computer. The 
test data also serve as the foundation for the example anal- 
ysis described in the TUTORIAL file. For the developer, 
the agalma test command serves as a regression test 
to check if changes break existing features. We routinely 
run this test in the course of adding and refactoring code. 
The test data also serve as a minimal case study for devel- 
oping new features without needing to first download and 
curate data. 

Results and discussion 

In addition to the small test data sets included with 
Agalma, here we present an example analysis of larger data 
sets from seven species. Though most phylogenetic anal- 
yses would include more taxa than this simple example 
case, the size of the dataset for each species is typical for 
contemporary phylogenomic analyses. This seven-taxon 
data set consists of new raw Illumina reads for five species 
of siphonophores, Abylopsis tetragona, Agalma elegans, 
Nanomia bijuga. Physalia physalis, Craseoa sp., and gene 
predictions for two outgroup taxa, Nemato stella vectensis 
[27] and Hydra magnipapillata [28], produced by previ- 
ous genome projects. mRNA for the five siphonophore 
samples was isolated with Dynabeads mRNA DIRECT 
(Life Technologies) and prepared for sequencing with the 
TruSeq RNA Sample Prep Kit (Illumina). The sample 
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preparation was modified by including a size selection 
step (agarose gel cut) prior to PCR amplification. Analyses 
were conducted with Agalma at git commit dc54 9d2 3 
and BioLite at git commit 025fe65e (versions 0.3.4 with 
patches). 

We deposited the new data in a public archive 
(NCBI Sequence Read Archive, BioProject PRJNA205486) 
prior to running the final analysis. A git repository 
of the scripts we used to execute the example anal- 
yses is available at https://bitbucket.org/caseywdunn/ 
dunnhowisonzapata2013. These scripts download the 
data from the public archives, execute the analyses, and 
generate the analysis reports. All of the figures presented 
here are taken from the analysis reports generated by 
Agalma. This illustrates how a fully reproducible and open 
phylogenomic analysis can be implemented and commu- 
nicated with Agalma. These scripts can be used as they are 
to repeat the analyses. They could also be modified to try 
alternative analysis strategies on these same data, or they 
could be adapted to run similar analyses on different data. 



Assembly 

The tabular assembly report (index . html in Additional 
file 1) summarizes assembly statistics across samples, and 
links to more detailed assembly reports for each sample. 
For the example analysis, this summary indicates, among 
other things, that the fraction of rRNA in each library 
ranged from 0.4% to 27.2% and the insert sizes were on 
average 266 bp long. The detailed assembly reports have 
extensive diagnostics that pertain to sample quality and 
the success of library preparation. As an example, Figure 1 
shows several of the plots from the detailed assembly 
report for Agalma elegans, the siphonophore after which 
our tool is named. The distribution of sequencing effort 
across transcripts (Figure la) and the size distribution of 
transcripts (Figure lb) are typical for de novo Illumina 
transcriptome assemblies. 

Phylogenetic analyses 

The Agalma phylogeny report includes a plot of the 
number of sequences considered at each stage of the 
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Figure 1 Transcriptome assembly statistics for Agalma elegans. (a) Cumulative read coverage, sorted from the most frequently sequenced 
exemplar transcripts to the least frequently sequenced transcripts. Most reads map to a small fraction of the genes, as is typical for a transcriptome 
sequencing project, (b) The size distribution of the assembled transcripts, for all transcripts and for only those transcripts that have a blastx hit to a 
protein in the Swissprot database. Most small transcripts do not have blastx hits and likely include some non-coding RNA and many assembly errors. 
These are removed prior to phylogenetic analysis. See Additional file 1 for further assembly diagnostics. 
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Figure 2 The number of gene sequences under consideration at each stage of super matrix construction. Most gene sequences are 
eliminated in the first step of homology evaluation (the parse_edges stage of the homologize pipeline), due to low similarity to other 
sequences. Of the remaining sequences, many are eliminated during multiple sequence alignment (the multalign pipeline) in clusters failing 
taxon sampling and cluster size criteria reflecting uncertainty in homology across species, and during orthology evaluation (the treeprune 
pipeline) mainly reflecting poor sampling of some ortholog groups. See Additional file 2 for further diagnostics regarding matrix construction. 



analysis (Additional file 2). In the example analysis, the 
step that removed the most sequences was the first step 
of homology evaluation in the homologize pipeline 
(Figure 2). This reduction is due to low similarity to other 
sequences and clusters with poor taxon sampling. The 
next major reduction occurred in cluster refinement in 
multalign. This reduction is largely due to the elimi- 
nation of clusters that failed the taxon sampling criteria, 
and reflects uncertainty regarding the homology of some 
sequences and sparse sampling of some homologs. The 
next major reduction in the number of genes occurred 
in treeprune. These reductions are due to both uncer- 
tainty regarding orthology and poor sampling of some 



ortholog groups. The preliminary species tree for the 
example analysis (Figure 3) is congruent with previous 
analyses of siphonophore relationships [29] . 

Resource utilization 

Phylogenomic analyses are computationally intensive. 
Detailed information about resource utilization helps 
investigators plan resources for projects and balance com- 
putational load more efficiently. It is also critical for the 
optimization of the analyses, and can help guide design 
decisions. For each analysis, Agalma produces a resource 
utilization plot that displays the time and maximum 
memory used by external executables (Figure 4). The 
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Figure 3 The preliminary maximum likelihood phylogeny resulting from the example analysis. This unrooted tree was inferred from the 
protein supermatrix under the WAG + r model. 
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Resource Usage for Agalma elegans (Runs 3,12-13,18,32,38) 
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Figure 4 A profile of computational resource utilization for the transcriptome pipeline. This plot is from the report for the Agalma elegans 
assembly. Although annotation of transcripts (the calls to blastx and blastp during the postasemble stage) showed the highest processing 
time (top row), assembly (the call to Trinity during the assemble stage) displayed the highest memory use and it took the longest real time to 
finish (bottom row). 



peak memory use, and the longest-running step, in the 
transcriptome pipeline was the call to Trinity during 
the assemble stage. 

Conclusions 

A distinction is sometimes drawn between manual 
approaches that enable close user inspection of data and 
results, and automated approaches that isolate the user 
from their results. This is a false dichotomy-automating 
analyses and examining the results closely are not mutu- 
ally exclusive. Automated analyses with detailed diag- 
nostics provide the best of both worlds. The user has 
a very detailed perspective on their analysis, and the 
added efficiencies of automation leave the investigator 
with far more time to assess these results. Automa- 
tion also allow improvements made in the context of 



one study to be applied to other studies much more 
effectively. 

For a study to be fully reproducible, both the data and 
the analysis must be described explicitly and unambigu- 
ously. The best description of an analysis is the code that 
was used to execute the analysis. By automating phy- 
logenomic analyses from data download through matrix 
construction and preliminary phylogenetic trees, Agalma 
enables fully reproducible phylogenomic studies. This will 
allow reviewers and readers to reproduce an entire analy- 
sis exactly as it was run by the authors, without needing to 
re-curate the same dataset or rewrite the analysis code. 

There are alternative approaches to many of the steps 
in a phylogenomic analysis presented here. There are, for 
example, multiple tools that identify orthologs accord- 
ing to different methods and criteria [30,31]. Agalma is a 
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general framework and can be expanded to include these 
additional methods, and directly compare them in the 
context of a complete workflow that is consistent in all 
other regards. 

Availability and requirements 

Project name: Agalma 

Project home page: https://bitbucket.org/caseywdunn/ 
agalma 

Operating system(s): Linux, Mac OS X 
Programming language: Python 

Other requirements: BioLite, and the dependencies 
listed in the BioLite INSTALL file 
License: GNU 



Additional files 



Additional file 1 : HTML report for assembly of the sample data sets. 

The HTML report for the assembly of the test data sets from raw reads. The 
tabular report (index.html) provides an overview across the five assemblies 
for the ingroup taxa, and includes links (in the Catalog ID column) to 
detailed reports for the assembly of each species. Fasta files for the 
annotated transcripts have been removed from the report to reduce file 
size. 

Additional file 2: HTML report for phylogenetic analyses. The HTML 
report for the phylogenetic analysis of the sample data. 
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